Unsupervised and Knowledge-Free Learning of Compound Splits and Periphrases

نویسندگان

  • Florian Holz
  • Christian Biemann
چکیده

We present an approach for knowledge-free and unsupervised recognition of compound nouns for languages that use one-wordcompounds such as Germanic and Scandinavian languages. Our approach works by creating a candidate list of compound splits based on the word list of a large corpus. Then, we filter this list using the following criteria: (a) frequencies of compounds and parts, (b) length of parts. In a second step, we search the corpus for periphrases, that is a reformulation of the (single-word) compound using the parts and very high frequency words (which are usually prepositions or determiners). This step excludes spurious candidate splits at cost of recall. To increase recall again, we train a trie-based classifier that also allows splitting multipart-compounds iteratively. We evaluate our method for both steps and with various parameter settings for German against a manually created gold standard, showing promising results above 80% precision for the splits and about half of the compounds periphrased correctly. Our method is language independent to a large extent, since we use neither knowledge about the language nor other language-dependent preprocessing tools. For compounding languages, this method can drastically alleviate the lexicon acquisition bottleneck, since even rare or yet unseen compounds can now be periphrased: the analysis then only needs to have the parts described in the lexicon, not the compound itself.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Compound Splitting With Distributional Semantics Rivals Supervised Methods

In this paper we present a word decompounding method that is based on distributional semantics. Our method does not require any linguistic knowledge and is initialized using a large monolingual corpus. The core idea of our approach is that parts of compounds (like “candle” and “stick”) are semantically similar to the entire compound, which helps to exclude spurious splits (like “candles” and “t...

متن کامل

An Unsupervised Learning Method for an Attacker Agent in Robot Soccer Competitions Based on the Kohonen Neural Network

RoboCup competition as a great test-bed, has turned to a worldwide popular domains in recent years. The main object of such competitions is to deal with complex behavior of systems whichconsist of multiple autonomous agents. The rich experience of human soccer player can be used as a valuable reference for a robot soccer player. However, because of the differences between real and simulated soc...

متن کامل

Annotated Suffix Tree Method for German Compound Splitting

The paper presents an unsupervised and knowledge-free approach to compound splitting. Although the research is focused on German compounds, the method is expected to be extensible to other compounding languages. The approach is based on the annotated suffix tree (AST) method proposed and modified by Mirkin et al. To the best of our knowledge, annotated suffix trees have not yet been used for co...

متن کامل

Deep Unsupervised Domain Adaptation for Image Classification via Low Rank Representation Learning

Domain adaptation is a powerful technique given a wide amount of labeled data from similar attributes in different domains. In real-world applications, there is a huge number of data but almost more of them are unlabeled. It is effective in image classification where it is expensive and time-consuming to obtain adequate label data. We propose a novel method named DALRRL, which consists of deep ...

متن کامل

Unsupervised and Knowledge-free Morpheme Segmentation and Analysis

This paper presents a revised version of an unsupervised and knowledge-free morpheme boundary detection algorithm based on letter successor variety (LSV) and a trie classifier (Bordag, 2006a). Additionally a morphemic analysis based on contextual similarity provides knowledge about relatedness of the found morphs. For the boundary detection the challenge of increasing recall of found morphs whi...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008